show the code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse,igraph,grid,gridExtra,DT,RColorBrewer)LIU YAN
June 17, 2023
FishEye International, a non-profit organization dedicated to combatting illegal, unreported, and unregulated (IUU) fishing, has been granted access to an international finance corporation’s comprehensive database on fishing-related companies. Based on prior investigations, FishEye has observed a strong correlation between companies exhibiting anomalous structures and their involvement in IUU activities or other suspicious practices. To facilitate their efforts, FishEye has transformed the database into a knowledge graph, encompassing extensive information about the companies, owners, workers, and financial status.

The objective of this exercise is to use visual analytics to identify anomalies in the business groups present in the knowledge graph.
This exercise source is from: Mini-Challenge 3 task 1
The code chunk below will be used to install and load the necessary R packages to meet the data preparation, data wrangling, data analysis and visualisation needs.
jsonlite: A simple and robust JSON parser and generator for R.
tidygraph: this package provides a tidy API for graph/network manipulation.
ggraph: ggiraph is a tool that allows you to create dynamic ggplot graphs.
visNetwork: an R package for network visualization, using vis.js javascript library.
graphlayouts: Several new layout algorithms to visualize networks are provided which are not part of ‘igraph’.
ggforce: aims to be a collection of mainly new stats and geoms that facilities for composing specialised plots.
skimr: is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors.
tidytext: make many text mining tasks easier, more effective, and consistent with tools already in wide use.
tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.
igraph:igraph is a fast and open source library for the analysis of graphs or networks.
grid: provides a set of functions and classes that represent graphical objects.
gridExtra:provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
DT: provides an R interface to the JavaScript library DataTables.
RColorBrewer:provides color schemes for maps (and other graphics) designed by Cynthia Brewer.
In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.
The output is called mc3_data. It is a large list R object.The knowledge graph (KG) encompasses a vast network comprising 27,622 nodes and 24,038 edges, forming 7,794 interconnected components. Notably, this KG is an undirected multi-graph, and at the top-level, the graph is represented as a dictionary with graph-level properties.
Node Attributes:
type – Type of node as defined above.
country – Country associated with the entity. This can be a full country or a two-letter country code.
product_services – Description of product services that the “id” node does.
revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.
id – Identifier of the node is also the name of the entry.
role – The subset of the “type” node, not in every node attribute.
dataset – Always “MC3”.
Edge Attributes:
type – Type of the edge as defined above.
source – ID of the source node.
target – ID of the target node.
dataset – Always “MC3”.
role - The subset of the “type” node, not in every edge attribute.
In this section, we undertake data exploration techniques to enhance our understanding of the dataset.
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
# Convert links data to tibble format
mc3_edges <- as_tibble(mc3_data$links) %>%
distinct() %>% # Remove duplicate edges
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
# Group edges by source, target, and type
group_by(source, target, type) %>%
# Calculate the number of edges for each source/target/type combination
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()distinct() is used to ensure that there will be no duplicated records.
mutate() and as.character() are used to convert the field data type from list to character.
group_by() and summarise() are used to count the number of unique links.
thefilter(source!=target)is to ensure that no record with same source and target.
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
# Convert nodes data to tibble format
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
# Select specific columns for the resulting tibble
select(id, country, type, revenue_omu, product_services)mutate() and as.character() are used to convert the field data type from list to character.
To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.
select() is used to re-organise the order of the fields.
Based on the table provided, it is apparent that certain observations have product services that are unrelated to the fishing industry. For example, Jones LLC(id) offers product services related to automobiles rather than fishing.
From section 2.3.1, a notable observation was made regarding the presence of numerous nodes whose product_service is unrelated to the fishing industry. To ensure the effectiveness of subsequent analyses, it is imperative to filter out these irrelevant nodes. Thus, we will employ text sensing techniques and utilize the tidytext package to analyze the keywords present in the product_service attribute of the nodes.
In this section, tokenisation will be used in the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
By default, punctuation has been stripped.
By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).
The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
the tidytext package has a function called stop_words that will help us clean up stop words.
# New words to be added
new_words <- data.frame(word = c("unknown", "character","0", "products","equipment","services","accessories","related"))
# Combine existing stop words with new words
updated_stop_words <- bind_rows(stop_words, new_words)
stopwords_removed <- token_nodes %>%
anti_join(updated_stop_words)Create new stop words and update the stop_words by bind_rows
Then anti_join() of dplyr package is used to remove all stop words from the analysis.
We can visualise the words extracted by using the code chunk below.

count() counts the frequency of each unique word in the ‘word’ column, sorting the result in descending order.
top_n(15) selects the top 15 words with the highest frequency.
mutate() reorders the ‘word’ column based on the frequency (‘n’) of each word.
coord_flip() flips the coordinates, resulting in a horizontal bar plot.
Based on the above bar chart, it is evident that three fishing related specific keywords, namely “fish”,“seafood”,“salmon” exhibit relatively higher percentages compared to other words. Consequently, we will employ these three key words as filters to select nodes and edges for further analysis in subsequent steps.
In this section, we will create an analysis subset of nodes and edges specifically related to the terms “fish” ,“seafood”,“salmon”.
It is important to note that during our investigation, we discovered that only the “source” nodes in the mc3_edges dataset could be found in the mc3_nodes dataset. Therefore, when filtering for fishing-related edges, we will focus on the “source” node’s product_service attribute, specifically checking if it contains the keywords of interest. This filtered subset of edges will be referred to as mc3_edges_fishing.
Regarding the extraction of nodes, we will utilize the mc3_edges_fishing dataset. From this dataset, we will extract both the “source” and “target” node IDs. For the “source” nodes, we will apply a filter based on whether their product_service attribute contains the specified keywords. As for the “target” nodes, they will inherit the type attribute from the mc3_edges_fishing dataset.
By following this approach, we will establish a refined analysis dataset consisting of relevant nodes and edges, enabling further examination of the fishing domain.
# Filter edges based on product_service criteria
mc3_edges_fishing <- mc3_edges %>%
filter(
source %in% mc3_nodes$id[grep("fishing|seafood|salmon", mc3_nodes$product_services, ignore.case = TRUE)]) %>%
distinct()
# Extract relevant node IDs from mc3_edges_fishing
related_node_ids <- unique(c(mc3_edges_fishing$source, mc3_edges_fishing$target))
# Filter nodes based on related_node_ids and product_service criteria
mc3_nodes_source <- mc3_nodes %>%
filter(
id %in% related_node_ids &
grepl("fishing|seafood|salmon", product_services, ignore.case = TRUE)
)
# mc3_nodes2 now contains the filtered nodes
mc3_nodes_target <- mc3_edges_fishing %>%
select(target, type) %>%
rename(id = target)
# Create an empty dataframe containing all variables
empty_df <- data.frame(id = character(),country=character(), type = character(), revenue_omu=numeric(), product_services=character(),stringsAsFactors = FALSE)
# Fill in missing columns to match the number of columns
mc3_nodes_source <- bind_rows(mc3_nodes_source, empty_df)
mc3_nodes_target <- bind_rows(empty_df, mc3_nodes_target)
# Merge mc3_nodes_source and mc3_nodes_target
mc3_nodes_fishing <- bind_rows(mc3_nodes_source, mc3_nodes_target)
# Aggregated MC3_ Nodes_ Fishing1
mc3_nodes_fishing <- mc3_nodes_fishing %>%
group_by(id, country, type)%>%
summarise(revenue_omu = sum(revenue_omu))%>%
ungroup()grep("pattern", x, ignore.case = TRUE) searches for the specified “pattern” in the vector or character string “x” and returns the matching elements.
unique() returns the unique elements of the vector “x” by removing any duplicates.
distinct() removes duplicate rows from a data frame, keeping only the unique rows.
bind_rows() combines multiple data frames or tibbles by stacking them vertically to create a new data frame.
To analyze the network connection structure, we construct a tbl_graph object by utilizing the newly created mc3_nodes_fishing and mc3_edges_fishing datasets. This tbl_graph object represents the network structure and allows us to perform various network analysis tasks and explorations.
Degree centrality is a measure that quantifies the number of edges connected to a specific node in a network. It serves as an indicator of the node’s influence or centrality within the network, as nodes with a higher degree centrality exhibit a larger number of connections. In the context of the mc3_graph_fishing network, the degree centrality represents the number of companies or people that are connected to a particular node.
In order to capture this information, we calculate the degree centrality for each node in the mc3_graph_fishing network. The resulting degree centrality values are then stored as a variable within the mc3_nodes_fishing dataset.
# Calculate the node degree centrality for mc3_graph_fishing
node_degrees <- degree(mc3_graph_fishing)
# Create a new column "node_degrees" for mc3_nodes_fishing
mc3_nodes_fishing <- mc3_nodes_fishing %>%
mutate(degree = node_degrees)
# Plot Bar chart
ggplot(mc3_nodes_fishing, aes(x = degree)) +
geom_bar(fill = "#808de8", color = "black") +
labs(title = "Bar Chart of Degree Centrality",
x = "Degree Centrality",
y = "Frequency")
Closeness centrality is a measure that quantifies how close a node is to all other nodes in a network.In the context of the mc3_graph_fishing network, we compute the closeness centrality for each node. This involves determining the average shortest path length from a given node to all other nodes in the network. Nodes with shorter average shortest path lengths will exhibit higher closeness centrality scores.
The computed closeness centrality values are then stored as a variable within the mc3_nodes_fishing dataset.
# Calculate closeness centrality for mc3_graph_fishing
closeness_values <- mc3_graph_fishing %>%
pull(closeness_centrality)
# Create a new column "closeness_centralit" for mc3_nodes_fishing
mc3_nodes_fishing <- mc3_nodes_fishing %>%
mutate(closeness_centrality = closeness_values)
# Plot histogram
ggplot(mc3_nodes_fishing, aes(x = closeness_centrality)) +
geom_histogram(binwidth = 0.02, fill = "#808de8", color = "black") +
labs(title = "Histogram of Closeness Centrality",
x = "Closeness Centrality",
y = "Frequency")
Given that the network consists of three types of nodes: “Company,” “Company Contact,” and “Beneficial Owner,” we can calculate the following counts for each node:
• The number of “Company” nodes connected to it.
• The number of “Company Contact” nodes connected to it.
• The number of “Beneficial Owner” nodes connected to it.
# Initialize variables to store the counts
company_qty <- vector("integer", length = vcount(mc3_graph_fishing))
contacts_qty <- vector("integer", length = vcount(mc3_graph_fishing))
owner_qty <- vector("integer", length = vcount(mc3_graph_fishing))
# Iterate over each node
for (i in 1:vcount(mc3_graph_fishing)) {
node <- V(mc3_graph_fishing)[i]
neighbors <- neighbors(mc3_graph_fishing, node, mode = "all")
# Count the number of each type of node among the neighbors
company_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Company")
contacts_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Company Contacts")
owner_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Beneficial Owner")
}
# Add the counts as new columns to the node data frame
mc3_nodes_fishing$company_qty <- company_qty
mc3_nodes_fishing$contacts_qty <- contacts_qty
mc3_nodes_fishing$owner_qty <- owner_qtyvector() Creates an empty vector with a specified length or type.
V() is a function used to access the nodes (vertices) of a graph. In this code, V(mc3_graph_fishing) retrieves all the nodes in the mc3_graph_fishing graph.
sum() Calculates the sum of values.
To facilitate this analysis, we create three new variables: “company_qty,” “contacts_qty,” and “owner_qty” to store the respective counts of connected nodes for each node in the network. These variables will provide valuable insights into the connectivity patterns between different types of nodes and enable further examination of the network’s structure and relationships.
# Create each plot
plot1 <- ggplot(data = mc3_nodes_fishing, aes(x = company_qty)) + geom_bar(fill="#808de8",color = "black")+
ggtitle("Number of Companies Connected to Each Node")+
theme(plot.title = element_text(size = 7))
plot2 <- ggplot(data = mc3_nodes_fishing, aes(x = contacts_qty)) + geom_bar(fill="#808de8", color = "black")+
ggtitle("Number of Company Contacts Connected to Each Node")+
theme(plot.title = element_text(size = 7))
plot3 <- ggplot(data = mc3_nodes_fishing, aes(x = owner_qty)) + geom_bar(fill="#808de8", color = "black")+
ggtitle("Number of Beneficial Owners Connected to Each Node")+
theme(plot.title = element_text(size = 10))
# Combine the plots with modified layout
combined_plot <- grid.arrange(
arrangeGrob(plot1, plot2, ncol = 2),
plot3,
heights = c(2, 1)
)
In this section, we will generate the network representation of nodes and edges related to the fishing industry. Subsequently, we will conduct an analysis of the network, with a specific focus on highlighting the anomalies within the network.
Given that we have previously extracted the nodes and edges related to the fishing industry in section 2.4, we can now proceed to visualize the network.
However, since our primary focus is on identifying potential connections related to illegal fishing, we will exclude nodes and edges that exhibit a one-to-one pattern. These one-to-one patterns typically represent safe connections. To achieve this, we will apply a filter condition using ( degree == 1 & closeness_centrality == 1) to exclude such nodes and edges from our analysis.
# Create the network object mc3_graph_fishing
mc3_graph_fishing <- tbl_graph(nodes = mc3_nodes_fishing,
edges = mc3_edges_fishing,
directed = FALSE) %>%
# Calculate the degree centrality and closeness centrality
mutate(degree = centrality_degree(),
closeness_centrality = centrality_closeness())
mc3_graph_fishing %>%
# filter out the nodes that degree==1 & closeness centrality ==1
filter(!(degree == 1 & closeness_centrality == 1)) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = 0.5)) +
geom_node_point(aes(
size = degree, # set the node size
color = type, # set the node color
alpha = 0.5)) +
scale_size_continuous(range = c(1, 10)) +
theme_graph()
ggraph(layout = "fr") initiates the ggraph plotting system from the ggraph package with the Fruchterman-Reingold layout specified by layout = “fr”.
scale_size_continuous(range = c(1, 10)) adjusts the size range of the nodes using the scale_size_continuous() function. It specifies that the node sizes should range from 1 to 10.
Based on the findings from section 2.4.3, which examined the count of connection objects in the data, we observed that the majority of nodes in the network were associated with either zero or one company connection. A smaller number of nodes were connected to two companies, and only a very limited number of nodes were connected to three companies.
Therefore, we can infer that nodes, especially the “beneficial_owner_nodes” connected to three companies have a higher potential to be classified as anomalies within the network.
# Filter nodes with type "Beneficial Owner"
beneficial_owner_nodes <- mc3_nodes_fishing[mc3_nodes_fishing$type == "Beneficial Owner", ]
# Create bar chart for "company_qty"
plot <- ggplot(beneficial_owner_nodes, aes(x = company_qty)) +
geom_bar(fill = "#808de8", color = "black") +
ggtitle("Company Quantity for Beneficial Owner Nodes")
# Print the plot
print(plot)
By referring to the data table for mc3_nodes_fishing, it becomes evident that there is only one node, identified as “James Brown” that exhibits connections with three different companies. This unique characteristic of James Brown’s node suggests a higher likelihood of it being an anomaly within the network.
# Create a new column 'Index' with row numbers
mc3_nodes_fishing$Index <- seq_len(nrow(mc3_nodes_fishing))
# Select the nodes based on the given indices
selected_indices <- c(697, 73, 670, 426, 1623, 322, 58, 977, 327, 797, 302, 657, 727, 641, 918, 1230, 1712, 627, 217, 814, 73, 1798, 1366, 61, 897, 1217, 222)
selected_nodes <- mc3_nodes_fishing[mc3_nodes_fishing$Index %in% selected_indices, ]
selected_id <- selected_nodes$id
# Find the edges where both source and target nodes are from selected nodes' IDs
selected_edges <- mc3_edges_fishing[mc3_edges_fishing$source %in% selected_id & mc3_edges_fishing$target %in% selected_id, ]
colnames(selected_edges)[1:2] <- c("from", "to")
# Assuming 'selected_nodes' and 'selected_edges' are your existing data frames
# Add a new column 'size' based on the 'degree' column, scaled by a factor
selected_nodes$size <- selected_nodes$closeness_centrality * 1000
# Create a color palette based on the unique values in the 'type' column
palette <- brewer.pal(n = length(unique(selected_nodes$type)), name = "Set3")
# Assign colors based on the 'type' column
selected_nodes$color <- palette[as.factor(selected_nodes$type)]
# Create the visNetwork plot
g <- visNetwork(nodes = selected_nodes, edges = selected_edges, height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visNodes(color = ~color, size = ~size) %>%
visEdges(color = list(highlight = "lightgray")) %>%
visOptions(selectedBy = 'type',
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = TRUE) %>%
visLegend() %>%
visInteraction(navigationButtons = TRUE) %>%
visLayout(randomSeed = 123)
g# Convert the network object to igraph object
g <- as.igraph(mc3_graph_fishing)
# IMPORTANT ! set vertex names otherwise when you split in sub-graphs you won't be able to recognize them
g <- set.vertex.attribute(g,'name',index=V(g),as.character(1:vcount(g)))
# decompose the graph
sub.graphs <- decompose.graph(g)
# search for the sub-graph indexes containing 2 and 9
sub.graph.indexes <- which(sapply(sub.graphs,function(g) any(V(g)$name %in% c('697'))))
# merge the desired subgraphs
merged <- do.call(graph.union,sub.graphs[sub.graph.indexes])
plot(merged)
Upon observing the aforementioned visualization, it becomes apparent that the “James Brown” node exhibits a distinct pattern within the network. Specifically, it is connected to three separate company nodes, which stands out as a highly unique and notable characteristic within the overall network connections.
In section 3.2, “Network for Anomalies Nodes,” we utilized igraph to extract the network associated with the node “James Brown” However, we encountered a limitation where the id labels couldn’t be displayed next to the nodes. Consequently, we resorted to manually inputting the node’s id in the network to create new nodes and edges for generating the visNetwork plot. In future research, it is advisable to focus on either automating the creation of new nodes and edges or utilizing igraph to draw the network while incorporating id labels next to each node.
Regarding the anomalies nodes identified in the preceding sections, it is imperative to gather additional evidence to substantiate the involvement of anomalous companies in illegal fishing activities.
A visual analytics process should be developed to identify similar businesses and measure the similarity among the grouped businesses from the previous question. This entails leveraging visual analytics techniques and tools to explore and analyze the shared characteristics, patterns, or behaviors among the businesses in the identified groups.